Star Hotels Project

Context

A significant number of hotel bookings are called-off due to cancellations or no-shows. The typical reasons for cancellations include change of plans, scheduling conflicts, etc. This is often made easier by the option to do so free of charge or preferably at a low cost which is beneficial to hotel guests but it is a less desirable and possibly revenue-diminishing factor for hotels to deal with. Such losses are particularly high on last-minute cancellations.

The new technologies involving online booking channels have dramatically changed customers’ booking possibilities and behavior. This adds a further dimension to the challenge of how hotels handle cancellations, which are no longer limited to traditional booking and guest characteristics.

The cancellation of bookings impact a hotel on various fronts:

Objective

The increasing number of cancellations calls for a Machine Learning based solution that can help in predicting which booking is likely to be canceled. Star Hotels Group has a chain of hotels in Portugal, they are facing problems with the high number of booking cancellations and have reached out to your firm for data-driven solutions. You as a data scientist have to analyze the data provided to find which factors have a high influence on booking cancellations, build a predictive model that can predict which booking is going to be canceled in advance, and help in formulating profitable policies for cancellations and refunds.

Data Description

The data contains the different attributes of customers' booking details. The detailed data dictionary is given below.

Data Dictionary

Importing necessary libraries and data

Load the dataset

Data Overview

View the first and last 5 rows of the dataset.

Understand the shape of the dataset.

Check the data types of the columns for the dataset.

Observations:

Data Preprocessing

Let's check the duplicate data. And if any, we should remove it.

Let's drop the duplicate values

Check for missing values

Fixing the datatypes

Summary of the dataset

Observations:

Observations:

Exploratory Data Analysis (EDA)

Univariate Analysis

Let us explore the numerical variables first

Univariate Analysis of lead_time

Observations

Univariate Analysis of no_of_previous_bookings_not_canceled

Observations

Univariate Analysis of avg_price_per_room

Observations

Univariate Analysis of no_of_adults

Observations

Univariate Analysis of no_of_children

Observations

Univariate Analysis of no_of_weekend_nights

Observations

Univariate Analysis of no_of_week_nights

Observations

Univariate Analysis of required_car_parking_space

Observations

Univariate Analysis of arrival_year

Observations

Univariate Analysis of arrival_month

Observations

Univariate Analysis of arrival_date

Observations

Univariate Analysis of repeated_guest

Observations

Univariate Analysis of no_of_previous_cancellations

Observations

Univariate Analysis of no_of_special_requests

Observations

Let us now explore the categorical variables

Univariate Analysis of type_of_meal_plan

Observations

Univariate Analysis of room_type_reserved

Observations

Univariate Analysis of market_segment_type

Observations

Univariate Analysis of booking_status

Observations

Bivariate Analysis

Plot bivariate charts between numeric variables to understand their interaction with each other.

Correlation

Observations

Bivariate Scatter Plots

Observations

Relationship of Room Price and Market Segment

Observations

Relationship of repeated_guest and booking_status

Observations

Relationship of no_of_special_requests and booking_status

Observations

Relationship of no_of_adults and booking_status

Observations

Relationship of no_of_children and booking_status

Observations

Relationship of no_of_weekend_nights and booking_status

Observations

Relationship of no_of_week_nights and booking_status

Observations

Relationship of required_car_parking_space and booking_status

Observations

Relationship of lead_time and booking_status

Observations

Relationship of arrival_year and booking_status

Observations

Relationship of arrival_month and booking_status

Observations

Relationship of arrival_data and booking_status

Observations

Relationship of no_of_previous_cancellations and booking_status

Observations

Relationship of no_of_previous_bookings_not_canceled and booking_status

Observations

Relationship of avg_price_per_room and booking_status

Observations

Relationship of type_of_meal_plan and booking_status

Observations

Relationship of room_type_reserved and booking_status

Observations

Relationship of market_segment_type and booking_status

Observations

Insights Based on EDA

Data Preprocessing (contd.)

Lets see if there are any rows with both no_of_weekend_nights and no_of_week_nights as 0 and remove them

Lets see if there are any rows with both no_of_adults and no_of_children as 0

Lets see if there are any rows with no_of_adults 0 and non 0 no_of_children

Since we do not know the age of the children and if these bookings are made under special conditions , we will not remove these rows.

Outlier Detection

Outlier Treatment

Building a Logistic Regression model

Data Preparation

Lets convert the booking_status variable to a numerical variable.

Before we proceed to build a model, we'll have to encode categorical features.

Split Data

We'll split the data into train and test to be able to evaluate the model that we build on the train data.

Model evaluation criterion

Model can make wrong predictions as:

  1. Predicting a customer will not cancel the booking but in reality the customer cancels the booking. - Loss of revenue

  2. Predicting a customer will cancel the booking but in reality the customer did not cancel the booking. - spend more money for marketing and reduce prices.

Which case is more important?

How to reduce this loss i.e need to reduce False Negatives?

First, let's create functions to calculate different metrics and confusion matrix so that we don't have to use the same code repeatedly for each model.

Checking Multicollinearity

Logistic Regression (with statsmodels library)

Observations

Lets write a function that will build a loop and remove variables with p-value greater than 0.05 one by one then check the results again and repeat till we get no variables with p-values greater than 0.05.

X_train5 is the final Model with no variables with p-value greater than 0.05. Lets apply this model to statsmodels.

Now no feature has p-value greater than 0.05, so we'll consider the features in X_train5 as the final ones and lg5 as final model.

Coefficient interpretations

Converting coefficients to odds

Checking model performance on the training set

ROC-AUC

ROC-AUC on training set

Model Performance Improvement

Let's see if the recall score can be improved further, by changing the model threshold using AUC-ROC Curve.

Optimal threshold using AUC-ROC curve

Checking model performance on training set

Let's use Precision-Recall curve and see if we can find a better threshold

Checking model performance on training set

Model Performance Summary

Let's check the performance on the test set

Using model with default threshold

ROC curve on test set

Using model with threshold=0.33

Using model with threshold = 0.42

Model performance summary

Conclusion

Building a Decision Tree model

Checking model performance on training set

Checking model performance on test set

There is a huge disparity in performance of model on training set and test set, which suggests that the model is overfiiting.

Visualizing the Decision Tree

It is a very complex tree and difficult to interpret

Reducing over fitting

Using GridSearch for Hyperparameter tuning of our tree model

Checking performance on training set

Checking model performance on test set

Visualizing the Decision Tree

Cost Complexity Pruning

The DecisionTreeClassifier provides parameters such as min_samples_leaf and max_depth to prevent a tree from overfiting. Cost complexity pruning provides another option to control the size of a tree. In DecisionTreeClassifier, this pruning technique is parameterized by the cost complexity parameter, ccp_alpha. Greater values of ccp_alpha increase the number of nodes pruned. Here we only show the effect of ccp_alpha on regularizing the trees and how to choose a ccp_alpha based on validation scores.

Total impurity of leaves vs effective alphas of pruned tree

Minimal cost complexity pruning recursively finds the node with the "weakest link". The weakest link is characterized by an effective alpha, where the nodes with the smallest effective alpha are pruned first. To get an idea of what values of ccp_alpha could be appropriate, scikit-learn provides DecisionTreeClassifier.cost_complexity_pruning_path that returns the effective alphas and the corresponding total leaf impurities at each step of the pruning process. As alpha increases, more of the tree is pruned, which increases the total impurity of its leaves.

Next, we train a decision tree using the effective alphas. The last value in ccp_alphas is the alpha value that prunes the whole tree, leaving the tree, clfs[-1], with one node.

For the remainder, we remove the last element in clfs and ccp_alphas, because it is the trivial tree with only one node. Here we show that the number of nodes and tree depth decreases as alpha increases.

Recall vs alpha for training and testing sets

Checking model performance on training set

Checking model performance on test set

With post-pruning we are getting good and generalized model performance on both training and test set.

Visualizing the Decision Tree

Lead_time,market_segment_type_Online,no_of_Special_Requests are important features for post-pruning.

Comparing all the decision tree models

Conclusion

Actionable Insights and Recommendations

Let's have a quick final overview of the data using the pandas profiling library.